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Abstract 

The  inferior  temporal  cortex  (IT)  of  monkeys  is  thought  to  play  an  essential  role  in  visual  object  recogni¬ 
tion.  Inferotemporal  neurons  are  known  to  respond  to  complex  visual  stimuli,  including  patterns  like  faces, 
hands,  or  other  body  parts.  What  is  the  role  of  such  neurons  in  object  recognition?  The  present  study  ex¬ 
amines  this  question  in  combined  psychophysical  and  electrophysiological  experiments,  in  which  monkeys 
learned  to  classify  and  recognize  novel  visual  3 D  objects.  A  population  of  neurons  in  IT  were  found  to 
respond  selectively  to  such  objects  that  the  monkeys  had  recently  learned  to  recognize.  A  large  majority 
of  these  cells  discharged  maximally  for  one  view  of  the  object,  while  their  response  fell  off  gradually  as  the 
object  was  rotated  away  from  the  neuron’s  preferred  view.  Most  neurons  exhibited  orientation-dependent 
responses  also  during  view-plane  rotations.  Some  neurons  were  found  tuned  around  two  views  of  the 
same  object,  while  a  very  small  number  of  cells  responded  in  a  view-invariant  manner.  For  five  different 
objects  that  were  extensively  used  during  the  training  of  the  animals,  and  for  which  behavioral  perfor¬ 
mance  became  view-independent,  multiple  cells  were  found  that  were  tuned  around  different  views  of  the 
same  object.  No  selective  responses  were  ever  encountered  for  views  that  the  animal  systematically  failed 
to  recognize.  The  results  of  our  experiments  suggest  that  neurons  in  this  area  can  develop  a  complex 
receptive  field  organization  as  a  consequence  of  extensive  training  in  the  discrimination  and  recognition  of 
objects.  Simple  geometric  features  did  not  appear  to  account  for  the  neurons’  selective  responses.  These 
findings  support  the  idea  that  a  population  of  neurons  -  each  tuned  to  a  different  object  aspect,  and  each 
showing  a  certain  degree  of  invariance  to  image  transformations  -  may,  as  an  assembly,  encode  complex 
3 D  objects.  In  such  a  system,  several  neurons  may  be  active  for  any  given  vantage  point,  with  a  single 
unit  acting  like  a  blurred  template  for  a  limited  neighborhood  of  a  single  view. 


Copyright  ©  Massachusetts  Institute  of  Technology,  1994 


This  paper  describes  research  done  at  the  Baylor  College  of  Medicine,  and  the  Center  for  Biological  and  Computational  Learn¬ 
ing  in  the  Department  of  Brain  and  Cognitive  Sciences  at  the  Massachusetts  Institute  of  Technology.  Nikos  K.  Logothetis 
was  supported  by  the  contract  NOOO  14-93-1-0209  of  the  Office  of  Naval  Research  (1992)  and  the  McKnight  Endowment  Fund 
for  Neuroscience  (1993).  Tomaso  Poggio  was  supported  by  the  Office  of  Naval  Research  contract  N00014-93-1-0385,  and  by 
the  NSF  grant  ASC-92-1T041. 


~Ea  □□ 


1  Introduction 

Object  recognition  can  be  thought  of  as  the  process  of 
matching  the  image  of  an  object  to  its  representation 
stored  in  memory.  Because  different  viewing,  illumina- 
r;:n.  and  context  conditions  generate  different  retinal 
images,  the  nature  of  the  stored  representation  and  the 
process  of  normalization  of  the  sensory  input  presents 
one  of  the  greatest  challenges  to  understanding  biolog¬ 
ical  recognition.  It  is  well  known  that  familiar  objects 
are  recognized  regardless  of  viewing  angle,  scale  or  po¬ 
sition  in  the  visual  field.  How  is  such  perceptual  object 
constancy  accomplished?  Does  the  brain  transform  the 
sensory  or  the  stored  representation  to  discard  the  image 
variability  resulting  from  different  viewing  conditions,  or 
does  generalization  occur  as  a  consequence  of  perceptual 
learning,  that  is,  of  being  acquainted  with  different  in¬ 
stances  of  any  given  object?  The  present  paper  addresses 
one  aspect  of  this  issue,  namely,  how  the  primate  recog¬ 
nition  system  may  compensate  for  changes  in  viewing 
angle  and  distance,  ignoring  the  image  changes  resulting 
from  variation  of  the  illumination  and  context.  More¬ 
over,  the  issue  is  addressed  at  the  level  of  subordinate 
categorizations  of  objects. 

Studies  indicate  that  objects  can  be  identified  at  a 
number  of  levels  of  abstraction,  but  are  most  easily  rec¬ 
ognized  at  what  is  referred  to  as  the  basic  level  (Rosch 
et  ah,  1976).  For  instance,  a  barn  swallow  is  perceived 
first  as  a  bird  ,  rather  than  as  a  swallow  or  an  Avian. 
Classifications  above  the  basic  level  are  more  general 
and  are  called  superordinate .  In  contrast,  subordinate 
level  refers  to  classifications  below  the  basic  level  and 
are  more  specific,  sharing  a  great  number  of  attributes 
with  other  members  of  the  object  class.  The  behavioral 
performance  of  humans  for  subordinate  classifications  is 
strongly  view  dependent  (Rock  and  DiVita,  1987;  Tarr 
and  Pinker,  1990;  Edelman  and  Biilthoff.  1992),  pre¬ 
sumably  because  it  largely  relies  on  the  recognition  of 
subtle  differences  in  the  shape  of  complex  objects.  It 
is  also  this  type  of  classification  that  is  most  seriously 
impaired  by  circumscribed  damage  to  the  human  cere¬ 
bral  cortex  (Damasio,  1990).  It  appears  that,  at  least  in 
humans,  distinct  shape  differences  may  be  the  basis  for 
reliable  object  recognition  under  any  viewing  conditions. 
Objects  with  distinct  shape  are  easiest  and  fastest  recog¬ 
nized  whether  of  a  basic-level  or  not.  For  instance  a  pen¬ 
guin ,  i.e.  an  atypical  exemplar  the  basic-level  category 
birds ,  is  most  likely  to  be  first  recognized  as  “penguin” 
rather  than  as  a  “bird”,  a  classification  termed  entry 
level  recognition  (Jolicoeur  et  al.,  1984).  Penguins  do 
indeed  have  a  distinct  shape  when  compared  with  most 
other  animals,  but  also  differ  a  great  deal  from  any  other 
bird. 

Conceptual  hierarchies  like  those  mentioned  above  re¬ 
flect  certain  types  of  interactions  between  the  human 
perceiver  and  objects  in  the  environment.  As  such  they 
also  reflect  the  “default”  probabilities  of  the  required 
discriminations  for  any  given  class  of  objects.  Thus  in  a 
domain  of  expertise,  subordinate-level  categories  may  be 
as  differentiated  as  the  basic-level  categories,  and  the  for¬ 
mer  categorizations  may  be  as  fast  as  the  latter  (Tanaka 
and  Taylor,  1991).  Clearly,  in  the  nonhuman  primate 


categories  have  no  bearing  on  language.  Nonetheless, 
there  is  little  doubt  that,  monkeys  are  capable  of  cate¬ 
gorizations  of  objects  like  predators,  prey ,  infant  mon¬ 
keys ,  or  food :  categories  of  objects  usually  having  distinct 
shape  differences.  It  has  also  been  shown  that  monkeys 
can  be  trained  to  be  '‘experts  ’  in  discriminations  of  ob¬ 
jects  of  a  novel  class,  the  members  of  which  share  great 
shape  similarities  (Logothetis  et  al.,  1994).  It  is  this  lat¬ 
ter  type  of  object  discriminations  that  was  used  to  study 
the  spatial  reference  system  of  object  representations  in 
the  non-human  primate  and  the  activity  of  neurons  in 
the  temporal  cortex  during  the  execution  of  the  recogni¬ 
tion  task. 

The  reference  system  used  in  matching  object  shapes 
to  their  representations  encoded  in  visual  memory  is  a 
key  question  in  the  research  of  visual  object  recognition 
(Farah,  1985;  Ullman.  1989:  Tarr  and  Pinker,  1989). 
Theories  relying  on  object-centered  representations  as¬ 
sume  either  a  complete  three-dimensional  description 
of  an  object  (Ullman.  1989).  or  a  structural  descrip¬ 
tion  of  the  image  that  specifies  the  relationships  among 
viewpoint-invariant  volumetric  primitives  (Marr,  1982; 
Biederman,  1987).  Whereas  such  theories  correctly  pre¬ 
dict  the  view-independent  recognition  of  familiar  objects 
(Biederman,  1987),  they  fail  to  account  for  performance 
in  recognition  tasks  with  of  novel  objects  at  the  subordi¬ 
nate  level  (Rock  <V  DiVita.  1987:  Rock  et  al.,  1981;  Tarr 
&  Pinker,  1990;  Biilthoff  and  Edelman,  1992;  Edelman 
&  Biilthoff,  1992).  Viewpoint-dependent,  image-based 
models,  on  the  other  hand,  represent  three-dimensional 
objects  as  a  set  of  2 D  views,  or  aspects,  and  recognition 
consists  of  matching  image  features  against  the  views  in 
this  set. 

Although  such  models  can  account  for  the  perfor¬ 
mance  of  human  subjects  in  any  recognition  task,  they 
are  usually  considered  implausible  because  of  the  mem¬ 
ory  a  system  would  require  to  store  all  discriminable 
views  of  many  objects.  These  objections,  however,  have 
recently  been  challenged  by  computer  simulations  show¬ 
ing  that  a  simple  network  can  recognize  3 D  objects  by 
interpolating  between  a  small  number  of  stored  views 
(Poggio  and  Edelman,  1990:  Logothetis  et  al.,  1994). 
This  network  (Figure  1)  uses  a  small  set  of  sparse  data, 
corresponding  to  an  object's  training  views,  to  synthe¬ 
size  an  approximation  of  a  multivariate  function  (Poggio 
and  Girosi,  1990)  representing  the  object. 

In  such  a  network  a  view  can  be  represented  by  a  set 
of  any  image  features,  such  as  the  orientations  or  po¬ 
sitions  of  object  parts,  shape  metrics,  texture,  or  color. 
Complex  features  can  be  created  hierarchically  from  sim¬ 
pler  ones  as  shown  in  Figure  1.  The  performance  of  the 
network  was  tested  with  geometrical  features  like  the  po¬ 
sition  of  the  vertices  of  wire-objects  (Poggio  &;  Edelman, 
1990),  or  their  orientations  (Logothetis  et  al.,  1994),  or 
with  features  extracted  from  reai  images  of  wire-objects 
(Brunelli  and  Poggio,  1991b)  or  faces  (Brunelli  and  Pog¬ 
gio,  1991a).  The  actual  features  used  by  a  biologi¬ 
cal  recognition  system  are  presently  unknown  and  their 
nature  is  an  important  experimental  question  per  se. 
Nonetheless,  some  of  the  arbitrary  features  used  in  the 
simulations  can  provide  a  measure  of  object  similarity. 
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Figure  1:  (a)  Performance  of  a  regularization  network  trained  with  the  0°,60°,  120°,  and  180°  views  of  an  wire- 
object.  Each  “hidden-layer’  unit  takes  a  similarity-measure  between  a  novel  view  and  a  template  stored  in  the 
unit’s  memory,  by  calculating  the  euclidean  distance  ||V  —  T*||  of  the  input  vector  V  from  its  learned  view  Tj ,  and 
subsequently  computing  the  function  iq(V)  =  ea?p(-||V  -  T\:||2)  of  this  distance.  The  activity  of  the  entire  network 
is  conceived  of  as  the  weighted,  sum  of  each  unit’s  output  (F(V)  =  Ciexp(- ||V  -  T,-||2)).  A  decision  criterion 
can  be  applied  for  yes/no  type  of  performance.  The  basic  scheme  can  be  hierarchically  used  for  composing  complex 
features  out  of  simpler  ones  (small  inset). 

Based  on  such  features,  simple  simulations  argue  against  1991;  Fujita  et  al..  1992).  Such  cells  discharge  more 

the  implausibility  of  a  view-based  recognition  system.  strongly  to  complex  patterns  than  to  any  simple  stimu- 

Also  in  agreement  with  the  basic  idea  that  a  lim-  lus,  and  are  found  even  in  the  earliest  stages  of  ontogeny 

ited  number  of  views  might  be  sufficient  to  accomplish  of  the  primate  (Rodman  et  al.,  1993).  A  detailed  inves- 

view-invaraince,  are  recent  psychophysical  experiments  tigation  of  the  cells  showing  high  selectivity  for  faces  has 

showing  that  human  subordinate-level  recognition  per-  revealed  several  different  types  or  classes  of  neurons  in 

formance  can  be  best  predicted  by  assuming  that  sub-  the  superior  temporal  sulcus,  each  broadly  tuned  to  one 

jects  interpolate  between  familiar  object  views  (Biilthoff  view  of  the  head,  e.g.  full  face  or  profile  (Perrett,  1985). 
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view  and  gradually  worse  for  views  with  increasing  dis-  any  novel  object  as  a  result  of  extensive  training? 

tance  from  the  known  view.  Familiarity  with  two  views  Clinical  observations  have  shown  that  the  recognition 

of  an  object  allowed  the  interpolation  of  recognition  be-  of  living  things  can  be  selectively  impaired  (Farah  et  al., 

tween  the  views  if  they  were  close  enough  together,  say  1991).  This  may  imply  that  the  perception  of  faces  or  bi- 

75°  apart,  but  resulted  in  two  independent  regions  of  ological  forms  in  general  is  mediated  by  specialized  neu- 

generalization  if  they  were  far  apart,  say  160°.  In  most  ral  populations.  If  so,  then  the  complex-pattern  selec- 

cases,  however,  only  three  to  five  familiar  views  were  tivity  (faces,  body  parts,  etc.)  reported  in  the  above 

needed  for  the  animal  to  achieve  view-invariant  perfor-  studies  may  be  unique  to  the  representation  of  the  class 

mance  around  one  axis.  of  “living  things”,  with  different  encoding  mechanisms 

A  recognition  architecture  that  could  underlie  such  responsible  for  the  recognition  of  other  objects.  In  gen- 

performance  might  rely  on  small-scale  networks  with  eral,  objects  may  be  represented  by  large  populations 

units  that  are  broadly  tuned  to  views  or  features  of  a  of  cells  each  encoding  a  simple  feature,  or  the  conjunc- 

learned  object.  Neurons  responding  to  complex  2D  pat-  tion  of  simple  features  that  are  characteristic  for  a  given 

terns,  including  face  or  hand  views  (Gross  et  al.,  1972;  class.  Alternatively,  a  system  based  on  neurons  selec- 

Bruce  et  al.,  1981;  Rolls,  1984;  Desimone  et  al.,  1984;  tive  for  complex  configurations  may  provide  one  mecha- 

Yamane  et  al.,  1988),  have  indeed  been  reported  in  infer-  nism  for  encoding  any  object  that  cannot  undergo  much 

otemporal  cortex  of  the  monkey  by  different  researchers  meaningful  decomposition  in  the  course  of  recognition. 

(Richmond  et  al.,  1987;  Miyashita,  1988:  Tanaka  et  al.,  Some  subordinate  categorizations  cannot  rely  on  part 


decomposition.  We  are  uniikely  to  recognize  individual 
faces,  for  example,  by  simply  detecting  the  existence  01 
two  eyes,  the  nose  and  the  mouth,  as  each  individual 
is  likely  to  have  the  same  parts  in  approximately  the 
same  positions.  It  is  a  holistic  and/or  a  metric  repre¬ 
sentation  that,  probably  underlies  the  recognition  oi  a 
person’s  face.  The  same  reasoning  may  apply  for  the 
recognition  of  individual  objects  of  other  classes,  partic¬ 
ularly  artificial  objects  composed  of  similar  parts.  Thus, 
the  question  arises:  If  monkeys  are  extensively  trained 
to  identify  novel  3 D  objects  of  a  class  whose  members 
show  a  great  deal  of  structural  similarity,  then  would 
one  find  neurons  in  the  brain  which  respond  selectively 
to  the' views  of  such  objects? 

We  have  examined  this  possibility  using  two  classes 
of  novel,  computer-rendered  stimuli:  Gouraud-shaded 
wire-like  and  amoeboid  objects  (Biilthoff  &  Edelman. 
1992;  Edelman  &  Biilthoff,  1992;  Logothetis  et  al.,  1994b 
The  monkeys  were  trained  in  a  matching  task,  general¬ 
ized  across  translation,  scaling  and  orientation  changes. 
Within  an  object  class  the  target-distractor  similarity 
varied  between  one  extreme,  where  distractors  were  gen¬ 
erated  by  randomly  selecting  shape-parameters,  such  as 
the  positions  of  vertices  or  protrusions,  the  sharpness  ot 
angles  between  segments,  or  the  moment  of  inertia  of  the 
objects,  and  the  other  extreme,  where  distractors  were 
generated  by  adding  different  degrees  of  noise  to  the  pa¬ 
rameters  of  the  target.  A  variety  of  other  digitized  2 D 
or  3 D  patterns,  t.g.  .  geometric  objects,  scenes,  body- 
parts,  were  also  used  as  controls  in  the  physiological  ex¬ 
periments. 

2  Methods 

2.1  Subjects  and  Surgical  Procedures 

Two  juvenile  rhesus  monkeys  ( Macaca  mulatto)  weigh¬ 
ing  7-9  kg  were  tested  in  the  electrophysiological  studies. 
The  animals  were  cared  for  in  accordance  with  the  Na¬ 
tional  Institutes  of  Health  Guide,  and  the  guidelines  of 
the  Animal  Protocol  Review  Committee  of  the  Baylor 
College  of  Medicine. 

After  preliminary  training,  the  animal  underwent 
a  aseptic  surgery,  using  isoflurane  anesthesia  (1.291  - 
1.5%),  for  the  placement  of  the  head  restraint  post  and 
the  scleral  search  eye  coil.  Throughout  the  surgical  pro¬ 
cedure  the  heart  rate,  blood  pressure  and  respiration 
were  monitored  constantly  and  recorded  every  15  min¬ 
utes.  Body  temperature  was  kept  at  37  degrees  using  a 
heating  pad.  Postoperativelv,  the  monkey  was  adminis¬ 
tered  an  opioid  analgesic  (Buprenorphine  hydrochloride 
0.02  mg/kg,  IM)  every  6  hours  for  one  day,  and  Tylenol 
(10  mg/kg)  and  antibiotics  (Tribrissen  30  mg/kg)  for 
3-5  days.  At  the  end  of  the  training  period  another  ster¬ 
ile  surgery  was  performed  to  implant  a  chamber  for  the 
electrophysiological  recordings. 

2.2  Animal  Training 

Standard  operant  conditioning  techniques  with  positive 
reinforcement  were  used  to  train  the  monkey  to  perform 
the  task.  Initially,  the  animals  were  trained  to  recognize 
a  target's  zero  view  among  a  large  set  of  distractors. 


When  they  had  learned  the  zero  view  they  were  encour¬ 
aged  to  generalize  recognition  to  neighboring  views  re¬ 
sulting  from  progressively  larger  rotations  around  one 
axis.  The  criterion  required  before  training  with  another 
object  was  95%-  correct  over  a  range  of  ±90°  for  the  tar¬ 
get.  and  less  than  5%  false  alarm  rate  tor  all  distractors. 
In  the  early  stages  of  training  several  days  were  required 
to  train  the  animals  to  perform  the  same  task  for  a  new 
object.  Four  months  of  training  was  required  on  average 
for  the  monkey  to  learn  to  generalize  the  task  across  dif¬ 
ferent  types  of  objects  of  one  class,  and  about  six  months 
were  required  for  the  animal  to  generalize  for  different 
object  classes. 

The  similarity  of  the  targets  to  the  distractors  was 
gradually  increased  within  an  object  class.  In  the  fi¬ 
nal  stage  of  the  experiments  distractor  wire-objects  were 
generated  by  adding  different  degrees  of  position  or  ori¬ 
entation  noise  to  the  target  objects.  A  criterion  of  95% 
correct  for  several  objects  was  required  to  proceed  with 
the  psychophysical  data  collection. 

In  the  initial  training  phase,  the  animal  received  con¬ 
tinuous  feedback  about  its  performance.  Each  correct 
response  was  rewarded  with  a  drop  of  juice.  In  the  later 
stages  of  the  training  the  animals  were  reinforced  on  a 
variable-ratio  schedule  which  administered  a  reward  af¬ 
ter  a  specified  average  number  of  correct  responses  had 
been  given.  Finally,  in  the  last  stage  of  the  behavioral 
training  the  monkey  was  rewarded  only  after  ten  con¬ 
secutive  correct  responses.  The  end  of  the  observation 
period  was  signalled  with  a  full-screen,  green  light  and  a 
juice  reward  for  the  monkey.  The  variable-ratio  schedule 
was  also  used  throughout  the  period  of  psychophysical 
data  collection. 

During  the  behavioral  training,  independent  of  the  re¬ 
inforcement  schedule,  the  monkey  always  received  feed¬ 
back  as  to  the  correctness  of  each  response.  Incorrect 
reports  aborted  the  entire  observation  period.  During 
psychophysical  data  collection,  on  the  other  hand,  the 
monkey  was  presented  wiiT  novel  objects  and  no  feed¬ 
back  was  given  during  the  testing  period.  The  behavior 
of  the  animals  was  monitored  continuously  during  the 
data  collection  by  computing  on-line  hit  rate  and  false 
alarms.  Arbitrary  performance  or  the  development  of 
hand-preferences,  e.g.  giving  only  right  hand  responses, 
was  discouraged  during  psychophysical  data  collection 
by  randomly  interleaving  sessions  of  actual  data  collec¬ 
tion  with  sessions  in  which  a  novel  object  was  presented 
but  correct  performance  was  required  of  the  animal  (i.e., 
incorrect  responses  resulted  in  aborts). 

In  the  electrophysiological  experiments  the  animal 
was  required  to  maintain  fixation  throughout  the  en¬ 
tire  observation  period.  Eye  movements  were  measured 
using  the  scleral  search  coil  technique  and  digitized  at 
200Hz. 

2.3  Electrophysiological  recording 

Recording  of  single  unit  activity  was  done  using 
Platinum-Iridium  electrodes  of  2-3  Megohms  impedance. 
The  electrodes  were  advanced  into  the  brain  through 
a  guide  tube  mounted  into  a  ball-and-socket  positioner 
(Monkey  S5396:  AP  =  15,  L  =  22;  Monkey  B63A 
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Figure  2:  The  experimental  paradigm.  Each  observation  period  began  with  the  presentation  of  a  fixation  spot. 
Successful  fixation  was  followed  by  the  learning  phase,  after  which  up  to  ten  single,  static  views  of  either  the  target 
or  a  distractor  were  presented  sequentially  (testing  phase).  The  subject  was  required  to  respond  to  each  one  in  turn, 
indicating  a  choice  of  “target”  by  pressing  the  right  lever  or  “distractor”  by  pressing  the  left  lever.  Fixation  was 
maintained  for  the  duration  of  the  observation  period. 


AP  =  19,  L  =  22).  By  swivelling  the  guide  tube  dif¬ 
ferent  sites  could  be  accessed  within  an  approximately 
10x10mm  cortical  region.  Action  potentials  were  ampli¬ 
fied  (Bak  Electronics,  Model  1A-B),  and  routed  to  an 
audio-monitor  (Grass  AM-8)  and  to  a  time-amplitude 
window  discriminator  (Bak  Model  DIS-1).  The  output 
of  the  window  discriminator  was  used  to  trigger  the  real¬ 
time  clock  interface  of  a  PDP 11/83  computer. 

2.4  Visual  stimuli 

The  visual  objects  were  presented  on  a  monitor  situated 
97  cm  from  the  animal.  The  selection  of  the  vertices  of 
the  wire  objects  within  a  three-dimensional  space  was 
constrained  to  exclude  intersection  of  the  wire-segments 
and  extremely  sharp  angles  between  successive  segments, 
and  to  ensure  that  the  difference  in  the  moment  of  in¬ 
ertia  between  different  wires  remained  within  a  limit  of 
10%.  Once  the  vertices  were  selected  the  wire  objects 
were  generated  by  determining  a  set  of  rectangular  facets 
covering  the  surface  of  a  hypothetical  tube  of  a  given  ra¬ 
dius  that  joined  successive  vertices. 

The  spheroidal  objects  were  created  through  the  gen¬ 
eration  of  a  recursively-subdivided  triangle  mesh  ap¬ 
proximating  a  sphere.  Protrusions  were  generated  by 
randomly  selecting  a  point  on  the  sphere's  surface  and 
stretching  it  outward.  Smoothness  was  accomplished  by 
increasing  the  number  of  triangles  forming  the  polyhe¬ 
dron  that  represents  one  protrusion.  Spheroidal  stimuli 
were  characterized  by  the  number,  sign  (negative  sign 
corresponded  to  dimples),  size,  density  and  sigma  of 
the  gaussian  type  protrusions.  Similarity  was  varied  by 
changing  these  parameters  as  well  as  the  overall  size  of 
the  sphere. 

Test-views  were  typically  generated  by  ±10  to  ±180 


degree  rotations  around  the  vertical  (Y),  horizontal  (X), 
or  the  two  oblique  (±45°)  axes  lying  on  the  XY  plane. 

2.5  Data  Analysis 

Mean  spike  rates  are  distributed  symmetrically,  that  is 
the  mean  is  an  accurate  representation  of  central  ten¬ 
dency  coinciding  with  the  median  of  the  distribution. 
The  significance  of  differences  between  mean  spike  rates 
measured  during  the  target  presentations  and  those  mea¬ 
sured  during  the  distractor  presentations  can  therefore 
be  tested  by  using  the  non-parametric  Walsh  test  for 
two  related  samples  (Walsh,  1949).  For  our  sample 
size  (N  =  9  presentations  per  target-view  or  distrac¬ 
tor),  the  power-efficiency,  i.e.  approximately  the  per¬ 
centage  of  the  total  available  information  per  obser¬ 
vation  which  is  utilized  by  the  test,  of  the  one-tailed 
Walsh  test  at  a  =  0.011  is  98%  of  that  of  the  para¬ 
metric  t  test  at  a  =  0.05,  while  it  avoids  the  the  use 
of  assumption-laden  dispersion  measures.  The  neurons 
presented  here  as  view-selective  gave  equal  or  greater 
responses  to  target  views  than  to  the  views  of  the  de¬ 
tractors,  at  a  —  0.011(mm[d3, \{d\  ±  d5)]  >  0). 

3  Results 

3.1  View  selectivity 

Figure  2  describes  the  sequence  of  events  that  composes 
a  single  observation  period.  An  observation  period  be¬ 
gan  with  the  presentation  of  a  small  fixation  spot.  Suc¬ 
cessful  fixation  was  followed  by  the  learning  phase,  dur¬ 
ing  which  the  target  was  presented  for  2  to  4  seconds 
from  one  viewpoint.  This  view  of  the  target,  called  the 
training  view ,  was  presented  in  oscillatory  motion  ±15° 
around  a  fixed  axis  at  0.67Hz  to  provide  the  subject  with 


complete  3 D  structure  information.  The  learning  phase 
was  followed  by  a  short  fixation  period  after  which  the 
testing  phase  started.  A  testing  phase  consisted  of  up 
to  10  sequential  trials,  in  each  of  which  the  test  stim¬ 
ulus,  a  static  view  of  either  the  target  or  a  distractor, 
was  presented.  Thirty  target  views  12°  apart  and  60  to 
120  distractors  were  tested  in  a  given  session.  The  dura¬ 
tion  of  stimulus  presentation  was  500-800  msec,  and  the 
monkeys  were  given  1500  msec  to  respond  by  pressing 
one  of  two  levers:  the  right  lever  upon  presentation  of 
a  target  view  and  the  left  upon  presentation  of  a  dis¬ 
tractor.  Typical  reaction  times  were  below  1000  msec 
for  both  animals.  An  experimental  session  consisted  of 
a  sequence  of  60  observation  periods,  each  lasting  about 
25  seconds. 

A  total  of  970  IT  cells  were  recorded  from  two  mon¬ 
keys  during  combined  psychophysical  and  electrophys- 
iological  experiments,  in  which  the  subject  performed 
either  a  fixation  task,  or  the  recognition  task  described 
above.  All  data  barring  those  shown  in  the  last  figure 
were  collected  using  objects  that  the  monkeys  could  rec¬ 
ognize  from  any  viewpoint  (hit  rate  above  95%  for  all 
views,  and  false  alarm  below  5%  for  all  distractors).  The 
animals’  view-invariant  performance  in  the  case  of  these 
objects  was  a  result  of  training  on  multiple  views,  which 
lead  to  generalization  around  an  entire  axis,  and  even¬ 
tually  giving  feedback  for  all  views.  A  large  majority  of 
the  isolated  neurons  were  visually  active  when  plotted 
with  a  variety  of  simple  or  complex  stimuli,  including 
some  of  the  wire  or  spheroidal  objects.  Other  neurons 
were  inhibited  by  the  presentation  of  target  objects,  and 
a  small  fraction  of  cells  were  inhibited  by  any  stimulus 
including  the  fixation  spot. 

A  number  of  units,  however,  responded  selectively  to  a 
subset  of  views  of  one  of  the  known  target  objects,  firing 
much  less  or  not  at  all  for  the  distractors.  The  response 
of  these  neurons  for  different  views  was  approximated  by 
fitting  to  the  data  a  gaussian  function  centered  on  the 
view  eliciting  the  greatest  response.  If  a  cell  responded 
to  two  subsets  of  views,  as  was  the  case  for  several  cells, 
the  linear  sum  of  two  gaussian  functions,  one  centered  on 
each  “most  effective”  view,  was  used  to  fit  the  response. 
The  standard  deviation  of  these  functions,  which  can  be 
viewed  as  a  measure  of  the  generalization  field  of  the  cell, 
was  used  to  classify  the  neurons  based  on  the  following 
criterion.  Cells  (N  =  61)  were  considered  selective  if  they 
responded  significantly  more  to  target  views  within  two 
standard  deviations  of  the  preferred  view,  than  for  any 
of  the  distractors  (see  methods). 

An  example  of  a  view-selective  neuron  is  shown  in  Fig¬ 
ure  3a.  The  cell's  firing  rate  reached  a  maximum  upon 
presentation  of  one  particular  object  view  and  declined 
as  the  object  was  rotated  away  from  this  preferred  view. 
Figure  3b  shows  sixteen  out  of  the  60  tested  distractor 
wire-objects  and  an  associated  histogram  of  the  response 
each  elicited.  The  within-class  recognition  task  the  an¬ 
imal  was  performing  during  the  electrophysiological  ex¬ 
periments  provided  an  internal  control  against  common 
or  trivial  features  being  responsible  for  the  behavior  of 
the  neurons.  Examination  of  the  views  of  the  target 
for  which  the  cell  is  selective  reveals  a  couple  features 


that  may  be  characteristic  for  that  view  of  the  target. 
For  example,  the  inverted  “V”  (circled)  in  the  0°  view  in 
Figure  3a,  appears  to  be  a  prominent  feature  that  all  the 
response-eliciting  target  views  have  in  common.  Could 
the  neuron  simply  be  selectivly  firing  for  the  presence 
of  this  particular  feature?  This  is  not  likely  to  be  the 
case  as  an  inverted  “V”  is  also  present  in  several  of  the 
distractors  (see  the  circled  regions  of  distractors  18,  25, 
44.  49,  50  in  Figure  3b). 

Similar  results  were  obtained  with  the  class  of 
spheroidal  objects  (Figure  4).  Here,  too,  the  neuron  re¬ 
sponds  maximally  to  one  view  of  the  object,  72°  away 
from  the  zero- view,  with  its  response  declining  as  the 
angle  of  rotation  deviates  in  either  direction  from  the 
preferred  view.  Figure  4b  shows  the  “best-response 
eliciting  distractors.  Although  all  views  of  the  target 
have  one  particular  protrusion  which  remains  visible  in 
all  views,  this  alone  does  not  seem  to  be  sufficient  to 
elicit  any  sort  of  response.  As  indicated  by  the  circled 
region  of  view  “72°”,  all  of  the  views  eliciting  a  signifi¬ 
cant  response  share  the  presence  of  a  “face-like”  region 
containing  two  dimples  and  a  small  protrusion  in  the 
lower  right.  However,  similar  regions  are  also  present  in 
two  of  the  distractors,  12  and  14  in  the  bottom  half  of 
the  figure,  and  neither  of  these  elicit  any  activity  from 
the  cell  whatsoever. 

The  generalization  field  of  a  number  of  view-selective 
neurons  was  examined  for  all  rotations  in  depth  using 
views  neighboring  the  preferred  view  along  all  four  axes. 
An  example  is  shown  in  Figure  5a.  This  cell  responded 
best  to  the  0°  view  of  the  object  and  its  response  mag¬ 
nitude  decreased  with  increasing  angle  of  rotation  along 
all  axes.  A  small  percentage  of  the  view-selective  cells 
(5  out  of  61)  exhibited  their  maximum  discharge  rate  for 
two  views  180  degrees  apart  (Figure  5b).  The  same  pat¬ 
tern  was  observed  in  the  behavioral  performance  of  the 
monkeys  for  several  objects  (Logothetis  et  al.,  1994).  In 
both  cases,  this  type  of  response  was  specific  to  wire-like 
objects  whose  zero  and  180°  views  appeared  as  mirror- 
symmetrical  images  of  each  other,  due  to  accidental  min¬ 
imal  self-occlusion. 

Figure  6  shows  the  distribution  of  the  generalization 
fields  of  view-selective  cells  for  the  wire-like  and  the 
spheroidal  objects.  The  insets  show  the  coefficients  of  de¬ 
termination  indicating  the  goodness  of  fit.  Both  object- 
types  gave  similar  tuning  width,  which  was  always  less 
than  or  equal  to  the  behavioral  generalization  field  of 
monkeys  trained  with  one  view  of  similar  objects  (Logo¬ 
thetis  et  al.,  1994). 

A  number  of  the  objects  used  extensively  during  the 
training  of  the  animal  were  also  used  during  the  electro¬ 
physiology  sessions.  For  several  of  these  objects,  multi¬ 
ple  neurons  were  found  that  were  selective  for  different 
views  of  the  same  object.  Figure  7a  through  7d  illus¬ 
trates  such  a  case  for  four  units.  Three  out  of  the  970 
cells  responded  selectively  to  specific  objects  presented 
from  any  viewpoint.  Figure  7e  shows  such  a  neuron  that 
appears  to  have  properties  of  object-center  descriptions. 
The  cell  responds  about  equally  well  for  all  target  views 
and  significantly  less  to  any  of  the  120  distractors. 
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Figure  3:  View-selective  response  of  an  IT  neuron  for  a  wire-like  object.  Peristimulus  histograms  (PSTHs)  show  the 
activity  of  a  view-selective  neuron  when  (a)  the  target  or  (b)  distractors  were  presented.  The  ordinate  and  abscissa, 
labeled  in  the  lower  left,  are  the  same  for  both  the  upper  and  lower  sets  of  histograms.  The  insets  show  he  target 
and  the  distractors  views.  The  boxed  plot  is  the  zero  view,  presented  in  the  learning  phase.  Note  that  the  activity  of 
the  neuron  for  a  given  target  view  is  well  above  that  for  distractors  up  to  ±36°  from  the  preferred  view,  defining  the 
generalization  field  of  the  neuron.  The  dashed  circles  in  the  upper  half  (0°  view)  and  in  the  lower  half  (distractors 
18,  25.  44.  49.  50)  of  the  figure  serve  to  highlight  one  of  the  features,  an  inverted  “V” ,  which  all  of  these  images  have 
in  common  (see  text). 
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Figure  5:  (a)  Response  of  a  view-selective  neuron  to  rotations  around  the  preferred  view  along  four  axes.  The 
z-dimension  of  the  plot  is  spike  rate  and  the  x  and  y  dimensions  show  the  degrees  of  rotation  of  the  target  object 
along  either  or  both  of  these  axes.  The  volume  was  generated  by  testing  the  cell’s  response  for  rotations  out  to  ±60° 
around  the  x  and  y  axes  as  well  as  along  the  two  diagonals.  The  magnitude  of  response  fell  of  about  the  same  for 
rotations  away  from  0°  along  all  of  the  axes  tested.  The  activity  of  the  neuron  for  the  60  distractors  is  shown  in 
the  inset,  (b)  Response  of  a  neuron  selective  for  pseudo-mirror-symmetric  views,  180°  apart,  of  a  wire-like  object. 
The  filled  circles  are  the  mean  spike  rates  for  target  views  around  one  axis  of  rotation.  The  solid  black  line  is  a 
DWLS-smoothed  view-tuning  curve.  The  two  inset  images  depict  the  —120°  and  60°  views  around  both  of  which  the 
neuron  showed  view-selective  tuning.  The  activity  of  the  neuron  for  the  60  different  distractor  objects  used  during 
testing  is  shown  in  the  inset  gray  box. 
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Figure  6:  Distribution  of  the  standard  deviation  of  the  gaussians  fitted  to  the  view-tuning  curves  of  IT  neurons  for 
the  wire-like  (a)  and  the  amoeba  (b)  objects.  The  black  bars  in  both  plots  represent  the  61  view-selective  neurons. 
The  gray  bars  show  the  three  units  that  responded  in  a  view-invariant  manner  for  a  given  object.  The  insets  show 
the  coefficients  of  determination,  indicating  the  goodness  of  the  fit. 
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Figure  7:  (a)  -  (d)  View-selective  responses  of  neurons  tuned  to  different  views  of  the  same  wire-object.  All  data 
come  from  the  same  animal  (S5396).  The  filled  circles  are  the  mean  spike  rates  (N— 10),  and  the  thin  black  lines 
DWLS-smoothed  view-tuning  curves.  The  thick  gray  lines  are  a  nonlinear  approximation  of  the  data  (QNMT)  with 
the  function  R(6)  =  Ciexp(-{\\6  -  0*||)2/2of )  +  R0  ,  where  N  =  1  or  2.  (e)  An  example  of  a  neuron  showing 

view-invariant  repsonse  for  a  known  wire  object.  The  behavioral  performance  of  the  monkey  for  this  object  was 
view-independent  due  to  its  having  been  used  as  a  training  object  (see  text).  The  insets  in  (a)  through  (e)  show  the 
activity  of  the  neuron  the  60  or  120  distractors  used  during  testing. 
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3.2  Translation  and  scale  invariance 

Among  the  population  of  neurons  examined,  we  could 
identify  a  number  of  units  that  showed  a  large  degree  ot 
size  invariance.  Figure  8  is  an  example  of  a  view  selective 
neuron  the  response  of  which  was  found  to  be  invariant 
to  changes  in  size.  Whether  the  stimulus  substended  one 
degree  of  visual  angle  or  six  degrees  the  magnitude  of  the 
cells  response  was  the  same.  Note  that  the  fixation  spot, 
the  only  unchanging  part  of  the  stimulus,  did  not  elicit  a 
response  from  the  cell  during  the  first  500ms  of  the  trial 
before  the  stimulus  onset.  Figure  9  shows  the  response 
of  the  same  cell  when  tested  for  positional  invariance.  In 
this  case  the  center  of  the  stimulus  was  translated  7.5 
degrees  from  the  fixation  spot.  With  the  exception  ot 
the  brief  on-transient,  the  cell’s  activity  does  not  deviate 
from  the  baseline  for  all  tested  positions.  Thus,  this  cell, 
while  scale  invariant,  appears  to  be  position  dependent 
for  relatively  large  displacements.  The  responses  shown 
in  Figures  8  and  9  were  collected  during  a  simple  fixation 
task. 

The  response  of  eight  view-selective  neurons  were 
tested  for  scale  and  translation  invariance  in  the  context 
of  the  object  recognition  task  using  the  preferred  view 
of  the  object.  The  stimulus  sizes  used  subtended  from 
1.9  to  5.6  degrees  of  visual  angle,  and  the  positions  were 
tested  all  at  a  radial  distance  of  3.15  degrees.  An  exam¬ 
ple  of  a  view-selective  neuron  responding  invariant!}'  to 
changes  in  both  size  and  position  is  shown  in  Figure  10. 

This  particular  cell  was  selective  when  a  limited  re¬ 
gion  of  the  object  around  120  degrees  (Figure  10a)  was 
presented,  and  responded  3.5  times  more  for  the  pre¬ 
ferred  target  view  than  for  the  best  distractor  (Figure 
10b).  Responses  to  scaling  and  translation  were  tested 
using  the  preferred  view.  Figure  10c  shows  the  ratio 
of  the  target  response  to  the  mean  response  for  the  ten 
best  distractors  for  the  sizes  tested.  Note  that  all  of  the 
distractors  were  of  the  default  size  and  were  presented 
foveally.  The  responses  of  the  same  cell  to  translation 
are  plotted  in  Figure  lOd.  This  particular  neuron  showed 
some  variance  in  its  response  depending  on  stimulus  po¬ 
sition,  however,  in  ail  cases  its  response  for  an  eccen¬ 
trically  presented  target  wa s  still  at  least  twice  that  for 
foveally  presented  distractors.  Seventy-five  percent  of 
the  tested  neurons  gave  only  scale-invariant  responses 
while  35%  were  invariant  for  both  scale  and  position. 

3.3  Responses  to  rotations  in  the  view  plane 

Neurons  were  also  tested  for  rotation  in  the  view  plane. 
Most  units  appeared  to  be  orientation  selective  (Figure 
lib).  However,  the  initial  performance  of  the  animal 
also  appeared  to  be  orientation  dependent  for  any  given 
novel  object  rotated  in  the  view  plane  (Figure  11a).  In 
almost  all  cases,  however,  the  initial  generalization  field 
for  picture-plane  rotations  appears  to  be  broader  than 
that  typically  obtained  for  rotations  in  depth  (Logo- 
thetis  et  al..  1994).  Figure  11c  illustrates  the  behavioral 
progression  of  one  animal’s  recognition  performance  as 
it  evolved  from  initially  view-dependent  to  almost  com¬ 
pletely  view-invariant  for  two  different  objects.  Gener¬ 
alization  performance  often  progressed  rapidly,  over  the 
course  of  a  few  test  sessions,  to  view-invariant  perfor- 


mance.  This  is  in  strong  contrast  to  the  view-dependent 
performance  seen  for  rotations  in  depth,  which  changed 
very  little  for  the  duration  of  testing  (as  many  as  fifteen 
sessions  without  feedback). 

4  Discussion 

The  results  of  this  study  suggest  an  experience  depen¬ 
dent  plasticity  in  IT  neurons,  and  support  the  idea  of 
a  population  of  neurons  with  configurational  selectiv¬ 
ity  being  a  more  general  mechanism  for  encoding  com¬ 
plex.  “non-decomposable”  objects.  The  neurons  dis¬ 
cussed  above  responded  selectively  to  novel  objects  that 
the  monkey  had  recently  learned  to  recognize.  None 
of  these  objects  had  any  prior  meaning  for  the  animal, 
nor  did  they  resemble  anything  familiar  in  the  monkey’s 
environment.  View-selective  responses  were  found  for 
both  object  types  tested  and  were  not  limited  to  any 
one  single  region  of  the  an  object.  However,  when  cells 
were  tested  with  objects,  which  the  monkey  could  rec¬ 
ognize  only  from  a  specific  viewpoint,  no  selective  re¬ 
sponses  were  ever  encountered  for  views  that  the  an¬ 
imal  systematically  failed  to  recognize.  The  reported 
cell  responses  are  unlikely  to  reflect  a  general  sensa¬ 
tion  of  familiarity  or  arousal,  since  the  majority  of  the 
neurons  responded  selectively  to  a  subset  of  the  tested 
object-views,  even  when  the  -animal  s  recognition  per¬ 
formance  was  view-invariant  (as  in  all  cases  except  in 
Figure  11).  Thus  it  seems  that  neurons  in  this  area  may 
develop  complex,  configurational  selectivity  as  the  ani¬ 
mal  is  trained  to  recognize  specific  objects.  Such  neu¬ 
rons  can  be  regarded  as  “blurred-templates”,  the  tol¬ 
erance  of  which  to  small  rotations  in  depth  represents 
a  form  of  limited  generalization.  The  capacity  of  some 
IT  neurons  to  respond  to  both  an  object  view  and  its 
“pseudo-mirror-symmetrical”  view  can  be  viewed  as  a 
broader  form  of  generalization,  possibly  underlying  the 
reflection-invariance  observed  during  the  psychophysical 
experiments  (Logothetis  et  al.,  1994).  Distinguishing 
mirror  images  has  no  apparent  usefulness  to  any  animal, 
and  the  inability  of  normal  children  to  distinguish  be¬ 
tween  mirror-symmetrical  letters  or  words  (Orton,  1928; 
Corballis  and  McLaren,  1984)  may  be  an  adaptive  mode 
of  processing  visual  information,  and  not  a  “confusion” 
(Bornstein  et  al.,  1978;  Gross  and  Bornstein,  1978).  In 
fact,  theoretical  and  psychophysical  work  suggests  that 
reflection-invariance  facilitates  the  recognition  of  bilater¬ 
ally  symmetric  visual  objects  (Vetter  et  al.,  1994).  Inter¬ 
estingly,  neurons  responding  to  mirror-images  of  a  face 
appear  very  early  in  the  visual  system  of  the  monkey 
(Rodman  et  al.,  1993). 

A  significant  number  of  neurons  showed  response  in- 
varinace  to  affine  image  transformations.  Similar  re¬ 
sponse  behavior  has  been  earlier  reported  for  2D  pat¬ 
terns  like  the  Fourier  descriptors  (Schwartz  et  al.,  1983) 
and  for  faces  (Desimone  et  al.,  1984:  Rolls  and  Baylis, 
1986;  Tovee  et  al.,  1994).  In  our  sample,  position  in¬ 
variance  varied  from  one  extreme,  where  response  was 
strongly  reduced  with  small  translation  (often  less  than  2 
degrees),  to  the  other  extreme  where  response  remained 
largely  invariant  for  eccentricites  up  to  7.5  degrees. 

Surprising  was  the  degree  of  view-dependency  of  the 
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Figure  8:  Response  invariance  to  changes  in  size  in  a  view-tuned  neuron.  The  monkey  was  performing  a  simple 
fixation  task  in  which  each  trial  lasted  2500ms.  PSTHs  show  the  activity  of  the  neuron  over  the  course  of  a  trial. 
The  ordinate  is  spike  rate  and  the  abscissa  is  time.  The  animal  fixated  without  a  stimulus  for  the  first  500ms  at  which 
point  a  stimulus  would  appear  (indicated  by  the  dashed  line),  and  it  continued  to  fixate  for  2000ms,  responding  to  a 
change  in  fixation  spot  color  at  the  end  of  the  trial.  Each  stimulus  is  shown  to  the  side  of  its  respective  histogram. 
The  circled  stimulus  is  the  one  used  for  testing  view-selectivity. 
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Figure  9:  Responses  to  translation  of  an  object  in  the  picture-plane.  Data  are  from  the  cell  presented  in  Figure  7. 
The  activity  of  the  neuron  for  the  default  wire  presented  foveallv  (shown  in  Figure  7)  is  represented  here  by  the 
black  histogram  in  the  background  of  each  plot.  The  gray  PSTHs  show  the  activity  of  the  cell  for  the  eight  positions 
tested.  In  each  case  the  center  of  the  wire  was  translated  7.5  degrees  from  the  central  fixation  spot.  Other  than  a 
short  transient  of  activity,  cell  activity  is  barely  distinguishable  from  baseline  when  the  stimulus  is  presented  at  each 
of  the  eccentric  positions.  For  smaller  translations  (less  than  2  degrees),  however,  no  such  position  dependence  was 
observed. 
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Figure  10:  A  view-selective  neuron  responding  invariantly  to  changes  in  size  size  and  position,  (a)  Tuning  curve 
showing  activity  of  the  neuron  for  a  limited  region  of  the  object.  The  preferred  view  corresponds  to  a  120°  rotation 
of  the  object  around  the  Y-axis,  (b)  The  responses  of  the  cell  for  the  ten  best  distractors.  Distractors  were  always 
presented  foveally  and  at  the  default  size.  The  best  target  view  was  used  to  examine  the  cell’s  response  to  changes 
in  size  (c)  and  position  (d).  The  response  of  the  cell  is  plotted  in  both  graphs  as  a  ratio  of  the  mean-spike-rate  for  a 
target  view  to  the  mean  of  the  mean-firing  rates  for  the  top  ten  distractors.  The  bar  representing  the  response  to  the 
default  size,  is  indicated  by  the  asterisk  in  (c).  The  smallest  size,  1.9°,  was  used  to  test  translation.  The  ordinate  of 
the  graph  indicates  the  position  of  each  test  image  in  terms  of  its  azimuth  and  elevation. 
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Rotation  Around  Z  Axis  (picture-plane  rotation) 


Figure  11:  View-dependent  behavioral  performance  and  view-selective  neuronal  response  for  an  image  rotated  in  the 
picture-plane,  (a)  Performance  of  the  animal  in  terms  of  hit  rate  (N  =  9  trials  per  view).  In  this  example,  no  training 
was  given  for  the  zero  view  prior  to  testing,  (b)  The  plot  depicts  the  view- tuning  curve  of  the  neuron,  in  terms  of 
mean-spike-rate.  The  abscissa  of  both  plots  is  rotation  angle,  (c)  Improvement  of  performance  for  recognition  of 
views  resulting  from  view-plane  rotations.  The  X-axis  is  rotation  angle,  the  \ -axis  increasing  session  number,  and 
the  Z-axis  hit  rate.  One  test  session  included  ten  presentations  of  each  target  view,  thirty-six  in  all,  spaced  at  ten 
degree  intervals.  Each  curve,  starting  in  the  front  and  proceeding  to  the  back,  illustrates  the  performance  over  two 
test  session  (N  =  20  presentations  of  each  target  view).  The  animal  was  familiarized  with  the  zero- view  of  the 
object  during  one  brief  training  session  prior  to  testing.  No  feedback  was  given  during  the  testing  periods  as  to  the 
correctness  of  the  response. 
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cell  and  the  monkey  responses  for  rotations  in  the  plane 
of  view.  Psychophysical  studies  in  humans  have  revealed 
that  the  recognition  of  objects  rotated  in  the  picture- 
plane  is  different  than  the  recognition  of  objects  ro¬ 
tated  in  depth.  For  example,  Tarr  and  Pinker  (Tarr  k 
Pinker,  1989,  1990;  Tarr  and  Pinker,  1991)  studied  the 
effects  of  rotation  in  the  picture  plane  on  recognition  and 
found  that  familiarization  with  one  view  of  an  object  re¬ 
sults  in  view-independent  performance,  although  reac¬ 
tion  times  do  increase  with  deviation  from  the  learned 
view.  This  performance  can  be  altered  by  training  the 
subjects  briefly  on  a  second  view,  resulting  in  an  im¬ 
provement  in  performance  around  the  new  learned  view 
and  to  a  lesser  extent  for  those  views  between  the  two 
familiar  views.  In  our  experiments,  the  behavior  of  the 
monkeys  was  initially  strongly  view-dependent  in  terms 
of  error  rate.  In  contrast  to  the  recognition  performance 
observed  for  rotations  of  the  object  in  depth,  however, 
hit  rate  for  view-plane  rotations  increased  gradually  over 
successive  sessions  without  any  feedback  to  the  animal 
as  to  the  correctness  of  its  response.  No  neuron  was  iso¬ 
lated  long  enough  to  observe  any  possible  changes  at  the 
single- cell  level. 

A  question  that  arises  from  these  results  is:  are  such 
neurons  really  responding  to  the  “views”  of  the  tested 
objects?  Studies  by  Tanaka  and  his  colleagues  (Tanaka 
et  al.,  1991)  showed,  for  instance,  that  the  response  of 
many  neurons  to  complex  objects  can  be  mimicked  using 
simpler  forms  representing  regions  of  the  objects.  In  a 
similar  vein,  the  neurons  studied  here  could  be  respond¬ 
ing  to  a  reduced  set  of  features  of  the  wire  or  spheroidal 
objects  and  not  to  an  entire  view.  Two  observations 
seem  to  refute  such  an  alternative.  Firstly,  the  neurons 
were  tested  with  a  variety  of  simple  objects,  including 
geometric  patterns  of  different  orientations,  that  failed 
to  elicit  any  response.  Second,  the  presentation  of  be¬ 
tween  60  and  120  distractors  from  the  same  or  a  different 
object  class  served  as  a  selectivity-control  for  each  of  the 
targets.  Thus  in  the  case  of  the  wire-objects,  for  exam¬ 
ple,  given  the  largerlv  invariant  responses  of  IT  neurons 
for  small  translations  (Tovee  et  al.,  1994),  the  distractors 
had  at  least  60  different  combinations  of  simple  features 
like  orientations,  angles,  or  terminations,  some  of  which 
were  highly  similar  to  those  comprising  the  target  ob¬ 
ject.  As  a  matter  of  fact,  several  cells  did  respond  to  the 
presentation  of  the  target  and  to  a  number  of  distrac- 
tor  objects,  presumably  excited  by  such  simpler  features. 
However,  the  selective  cells  discussed  here  gave  minimal 
and  sometimes  no  response  for  distractor  objects,  even 
when  the  latter  shared  a  few  characteristic  regions  with 
the  target,  indicating  that  a  specific  organization  of  some 
features  was  required  for  eliciting  the  neuron’s  response. 

Nevertheless,  both  arguments  are  based  on  qualita¬ 
tive  observations,  and  what  we  present  here  as  “view- 
selectivity”  may  still  be  reducible  to  less  complex  fea¬ 
ture  constellations.  A  systematic,  mathematical  analy¬ 
sis  of  object-views  that  elicit  similar  neural  responses, 
and  an  attempt  to  develop  algorithms  for  biologically- 
plausible  image  decomposition  may  provide  an  answer 
to  the  selectivity  question,  and  this  is  the  focus  of  cur¬ 
rent  experiments. 


5  Conclusions 

Taken  together,  these  data  suggest  the  possibility  of  a 
recognition  architecture  similiar  to  that  schematically 
described  in  Figure  1.  The  discharge  rate  of  many  IT 
neurons  was  found  to  be  a  bell-shaped  function  of  orien¬ 
tation  centered  on  a  preferred  view.  A  very  small  number 
of  neurons  exhibited  object-specific  but  view-invariant 
responses  that  might  be  the  result  of  the  convergence  of 
view-dependent  units  into  neurons  showing  characteris¬ 
tics  of  object-centered  descriptions.  The  input  of  each 
view-selective  unit  can  be  considered  as  the  conjunction 
of  simpler  features  extracted  at  earlier  stages  in  the  vi¬ 
sual  system.  The  variability  in  the  degree  of  response 
invariance  during  affine  image  transformations  also  hints 
to  a  multilayer,  possibly  hierachical  architecture. 

Such  a  scheme  is  obviously  oversimplified  and  lacks 
top-down  mechanisms  that  strongly  affect  recognition 
performance.  The  processing  of  object  information  is  un¬ 
doubtedly  far  more  complex,  and  representations  might 
be  local  and  explicit  or  distributed  and  implicit  accord¬ 
ing  to  the  recognition  task  or  the  stimulus  context.  Al¬ 
though  the  ultimate  goal  of  a  recognition  system  is  to 
describe  grouped  object-features  in  a  more  abstract  for¬ 
mat  that  captures  the  invariant,  three-dimensional,  geo¬ 
metric  properties  of  an  object,  early  representations  may 
be  in  some  cases  strongly  configurational.  Moreover,  for 
visually  complex,  non-decomposable  objects,  like  many 
biologically  meaningful  objects,  holistic  representations 
may  be  the  only  ones  possible.  Neurons  selective  for 
object-views  and  tolerant  of  varying  extents  of  image 
transformations  may  then  be  elements  of  one  possible 
mechanism  for  such  representations. 
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